Bayesian Hierarchical Cross-Clustering
نویسندگان
چکیده
Most clustering algorithms assume that all dimensions of the data can be described by a single structure. Cross-clustering (or multiview clustering) allows multiple structures, each applying to a subset of the dimensions. We present a novel approach to crossclustering, based on approximating the solution to a Cross Dirichlet Process mixture (CDPM) model [Shafto et al., 2006, Mansinghka et al., 2009]. Our bottom-up, deterministic approach results in a hierarchical clustering of dimensions, and at each node, a hierarchical clustering of data points. We also present a randomized approximation, based on a truncated hierarchy, that scales linearly in the number of levels. Results on synthetic and real-world data sets demonstrate that the cross-clustering based algorithms perform as well or better than the clustering based algorithms, our deterministic approaches models perform as well as the MCMC-based CDPM, and the randomized approximation provides a remarkable speedup relative to the full deterministic approximation with minimal cost in predictive error.
منابع مشابه
A Hierarchical Distance-dependent Bayesian Model for Event Coreference Resolution
We present a novel hierarchical distancedependent Bayesian model for event coreference resolution. While existing generative models for event coreference resolution are completely unsupervised, our model allows for the incorporation of pairwise distances between event mentions — information that is widely used in supervised coreference models to guide the generative clustering processing for be...
متن کاملBayesian coclustering of Anopheles gene expression time series: study of immune defense response to multiple experimental challenges.
We present a method for Bayesian model-based hierarchical coclustering of gene expression data and use it to study the temporal transcription responses of an Anopheles gambiae cell line upon challenge with multiple microbial elicitors. The method fits statistical regression models to the gene expression time series for each experiment and performs coclustering on the genes by optimizing a joint...
متن کاملBayesian Hierarchical Modelling for Tailoring Metric Thresholds
Software is highly contextual. While there are cross-cutting ‘global’ lessons, individual software projects exhibit many ‘local’ properties. This data heterogeneity makes drawing local conclusions from global data dangerous. A key research challenge is to construct locally accurate prediction models that are informed by global characteristics and data volumes. Previous work has tackled this pro...
متن کاملRandomized Algorithms for Fast Bayesian Hierarchical Clustering
We present two new algorithms for fast Bayesian Hierarchical Clustering on large data sets. Bayesian Hierarchical Clustering (BHC) [1] is a method for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. BHC has several advantages over traditional distancebased agglomerative clustering algorithms. It defines a probabilistic model of the data a...
متن کاملAgglomerative Clustering of Bagged Data Using Joint Distributions
Current methods for hierarchical clustering of data either operate on features of the data or make limiting model assumptions. We present the hierarchy discovery algorithm (HDA), a model-based hierarchical clustering method based on explicit comparison of joint distributions via Bayesian network learning for predefined groups of data. HDA works on both continuous and discrete data and offers a ...
متن کامل